Antonio Debouse, Blake Freeman, Bodie Franklin, Eric Romero
Credit card companies are always in search of better ways to monitor borrowers to determine if the credit card borrower will default on their credit card payments or make them in full. Defaulted credit card payments are often difficult to recoup and create losses for these companies. Defaulting on a payment is defined as not meeting the debt obligation (which is the credit card payment). Our dataset is composed of 24 attributes and 30,000 records that reflect a Taiwanese credit card borrower’s payment history over a six month period.The data was pulled from UCI machine learning repository. The purpose of the dataset is to provide attributes at different points in their payment history to identify if a credit card borrower will default on their payments or pay in full. Since the dataset captures six payment periods, it gives the credit card firm a chance to identify if default will occur or not in various billing cycles. The effectiveness of a good classification algorithm is one that produces strong accuracy, sensitivity, and specificity scores through cross validation. If an effective classification model can be built, the credit company will have the ability to proactively monitor borrowers in various credit stages. The significance of identifying default or not will allow the credit card to minimize their losses. If early default identification occurs, the credit card company can reduce the borrower’s credit limits or preemptively work with the borrower to create new repayment plans. Both outcomes will help the credit company reduce their losses that would occur if no action were taken.
Amount of the given credit (NT dollar): nominal scale. combined total of credit (amount of money) given to the individual borrower and their family.
Gender: Categorical variable. 1 represents male and 2 represents female.
Education: ordinal scale 1 represents the highest level of education and 4 would be the lowest. 1 = graduate school, 2 = university, 3 = high school and 4 = others. Values 0,5,6 are undefined.
Marital status: Categorical variable. 1 = married, 2 = single, 3 = others. Value 0 is undefined.
Age: numerical. This attribute would be nominal. Measures how old a borrower is.
PAY_0 to PAY6: Categorical scale, these attributes describe the past monthly payment status of each made. For example, PAY_0 represents the payment status in September 2005 and PAY_6 represents the payment status in April 2005. -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
BILL_AMT1 to BILL_AMT6: nominal scale. This value represents the amount of the credit card bill in each respective month.
PAY_AMT1 to PAY_AMT2: nominal scale. This value represents the amount of the credit card bill paid in each respective month.
Default payment next month: categorical scale. 1 represents a default or missed payment. 0 represents payment made.
#Imports
import pandas as pd
import plotly as py
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# import of the Data set
bank_db = 'C:/Users/blake/OneDrive/Desktop/MSDS 7331/Project1/Data/bank_dataset.csv'
#Reading it in
bank_db_df = pd.read_csv(bank_db)
The data that was pulled was fairly clean to start when reviewing this data. However we did notice some factors that were not in the defined range of the data. This was apparent in categorical columns of Education and Marriage. Education had 3 additional values of 0, 5, 6 which occurred 345 times out of the 30,000 values in this column. Marriage had a value of 0 which occurred 54 times out of the 30,000 values. We addressed both these mistakes in the data by adding them to the “other” column that is denoted in each category. We decided to include these variables since it looked like a misclassification of the data type.
# First look at the Data Set
bank_db_df.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 2 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 3 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 4 | 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 25 columns
# Documentaion of the data set from'https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients'
# Noted below on the catagorical variables UCI stated
# Gender (1 = male; 2 = female)
Gender = bank_db_df.groupby('SEX')['ID'].nunique()
# Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
EDUCATION = bank_db_df.groupby('EDUCATION')['ID'].nunique()
# Marital status (1 = married; 2 = single; 3 = others)
MARRIAGE = bank_db_df.groupby('MARRIAGE')['ID'].nunique()
print(Gender)
print(EDUCATION)
print(MARRIAGE)
#based on this documentaion there are addtional variables in Education and Marital Status
### purposed code from Stack Overflow for this.
SEX 1 11888 2 18112 Name: ID, dtype: int64 EDUCATION 0 14 1 10585 2 14030 3 4917 4 123 5 280 6 51 Name: ID, dtype: int64 MARRIAGE 0 54 1 13659 2 15964 3 323 Name: ID, dtype: int64
#Group by balances
df_grouped = bank_db_df.groupby(by='default payment next month')
print(df_grouped.BILL_AMT1.mean())
print('--------')
print(df_grouped.BILL_AMT2.mean())
print('--------')
print(df_grouped.BILL_AMT3.mean())
print('--------')
print(df_grouped.BILL_AMT4.mean())
print('--------')
print(df_grouped.BILL_AMT5.mean())
print('--------')
print(df_grouped.BILL_AMT6.mean())
print('----------------------')
print(df_grouped.PAY_AMT1.mean())
print('--------')
print(df_grouped.PAY_AMT2.mean())
print('--------')
print(df_grouped.PAY_AMT3.mean())
print('--------')
print(df_grouped.PAY_AMT4.mean())
print('--------')
print(df_grouped.PAY_AMT5.mean())
print('--------')
print(df_grouped.PAY_AMT6.mean())
#Interesting that the mean pmts made/due are lower than people who defaulted.
default payment next month 0 51994.227273 1 48509.162297 Name: BILL_AMT1, dtype: float64 -------- default payment next month 0 49717.435670 1 47283.617842 Name: BILL_AMT2, dtype: float64 -------- default payment next month 0 47533.365605 1 45181.598855 Name: BILL_AMT3, dtype: float64 -------- default payment next month 0 43611.165254 1 42036.950573 Name: BILL_AMT4, dtype: float64 -------- default payment next month 0 40530.445343 1 39540.190476 Name: BILL_AMT5, dtype: float64 -------- default payment next month 0 39042.268704 1 38271.435503 Name: BILL_AMT6, dtype: float64 ---------------------- default payment next month 0 6307.337357 1 3397.044153 Name: PAY_AMT1, dtype: float64 -------- default payment next month 0 6640.465074 1 3388.649638 Name: PAY_AMT2, dtype: float64 -------- default payment next month 0 5753.496833 1 3367.351567 Name: PAY_AMT3, dtype: float64 -------- default payment next month 0 5300.529319 1 3155.626733 Name: PAY_AMT4, dtype: float64 -------- default payment next month 0 5248.220296 1 3219.139542 Name: PAY_AMT5, dtype: float64 -------- default payment next month 0 5719.371769 1 3441.482068 Name: PAY_AMT6, dtype: float64
# Group-By for the data by the Catagorical Variables - Noticed there were groups that were not defined by the data set
bank_db_df.groupby(['SEX','EDUCATION','MARRIAGE']).mean()
| ID | LIMIT_BAL | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SEX | EDUCATION | MARRIAGE | |||||||||||||||||||||
| 1 | 0 | 1 | 27213.500000 | 205000.000000 | 41.000000 | -1.500000 | -1.500000 | -1.500000 | -1.500000 | -1.000000 | -1.500000 | 11612.500000 | ... | 20295.500000 | 1926.500000 | 2901.500000 | 4345.000000 | 7876.500000 | 20354.500000 | 38.500000 | 2901.500000 | 2344.500000 | 0.000000 |
| 2 | 14308.000000 | 215000.000000 | 36.500000 | 0.000000 | -0.833333 | -1.166667 | -1.000000 | -1.166667 | -1.666667 | 11051.666667 | ... | 10767.333333 | 7372.166667 | 934.500000 | 9517.166667 | 14631.666667 | 7810.166667 | 5491.333333 | 1225.833333 | 3113.500000 | 0.000000 | ||
| 1 | 0 | 24444.000000 | 300000.000000 | 37.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 0.000000 | 0.000000 | 298.000000 | ... | 1576.000000 | 497.000000 | -3.000000 | 752.000000 | 179.000000 | 1579.000000 | 1.000000 | 0.000000 | 1246.000000 | 0.000000 | |
| 1 | 14549.651479 | 262485.207101 | 42.423077 | -0.220118 | -0.418935 | -0.459763 | -0.495266 | -0.521893 | -0.528402 | 60196.886391 | ... | 51360.938462 | 48348.436095 | 46303.767456 | 7430.248521 | 8108.434911 | 8130.040828 | 7016.669822 | 6897.355621 | 7466.460355 | 0.230769 | ||
| 2 | 14363.201671 | 192085.074060 | 31.704899 | -0.147360 | -0.266996 | -0.273073 | -0.320927 | -0.345993 | -0.362704 | 54064.399164 | ... | 46617.539309 | 42545.724269 | 40471.004558 | 6866.550323 | 6717.813141 | 6578.889480 | 5373.064565 | 5485.293581 | 6514.678314 | 0.192176 | ||
| 3 | 11595.233333 | 165333.333333 | 44.533333 | -0.033333 | -0.166667 | 0.000000 | -0.100000 | -0.133333 | -0.266667 | 64661.733333 | ... | 57469.500000 | 52784.100000 | 53506.600000 | 5375.300000 | 7199.600000 | 2644.600000 | 2719.366667 | 8242.933333 | 3479.766667 | 0.333333 | ||
| 2 | 0 | 24722.000000 | 360000.000000 | 38.000000 | -1.000000 | -1.000000 | -1.000000 | -2.000000 | -2.000000 | -2.000000 | 176.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 252.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
| 1 | 15296.578903 | 162101.265823 | 40.332489 | 0.115612 | 0.057806 | 0.008439 | -0.063291 | -0.117722 | -0.177215 | 56465.218143 | ... | 47514.228270 | 44393.006329 | 43099.936287 | 5455.835021 | 5856.121519 | 4710.043460 | 4800.077637 | 4719.933333 | 4900.730802 | 0.271308 | ||
| 2 | 14656.010204 | 112442.176871 | 31.584354 | 0.272789 | 0.239456 | 0.198639 | 0.113605 | 0.020748 | -0.038776 | 50926.275850 | ... | 41269.722449 | 37845.888435 | 36963.920748 | 4416.209524 | 4920.652041 | 4330.766667 | 4031.282993 | 3906.684694 | 4146.418027 | 0.253741 | ||
| 3 | 15215.031746 | 70952.380952 | 40.698413 | 0.142857 | -0.095238 | -0.063492 | 0.111111 | -0.015873 | -0.190476 | 35491.222222 | ... | 25978.333333 | 21556.095238 | 20738.285714 | 3281.158730 | 4496.920635 | 2394.793651 | 1864.380952 | 1799.079365 | 1633.000000 | 0.301587 | ||
| 3 | 0 | 11619.916667 | 96666.666667 | 38.000000 | 0.250000 | 0.250000 | 0.083333 | -0.250000 | -0.083333 | -0.333333 | 26227.916667 | ... | 21236.000000 | 16615.416667 | 15217.166667 | 10186.583333 | 3416.166667 | 6472.500000 | 3102.583333 | 1941.166667 | 4646.500000 | 0.166667 | |
| 1 | 14847.441794 | 140143.129771 | 44.338740 | 0.189885 | 0.131679 | 0.071565 | 0.000954 | -0.088740 | -0.111641 | 55910.685115 | ... | 43583.296756 | 40930.285305 | 39835.865458 | 4818.151718 | 4779.375000 | 3473.604008 | 4092.454198 | 3624.915076 | 3945.881679 | 0.289122 | ||
| 2 | 15117.636465 | 100847.516779 | 35.379195 | 0.268456 | 0.171141 | 0.093960 | -0.010067 | -0.054810 | -0.099553 | 45000.682327 | ... | 35999.019016 | 33015.140940 | 32308.956376 | 4399.395973 | 4447.173378 | 4156.166667 | 3555.214765 | 3923.678971 | 3642.455257 | 0.256152 | ||
| 3 | 13351.583333 | 69166.666667 | 47.722222 | 0.361111 | 0.138889 | 0.083333 | 0.027778 | -0.055556 | 0.000000 | 38469.972222 | ... | 24827.138889 | 24240.166667 | 22179.444444 | 2243.388889 | 4303.000000 | 2626.055556 | 2512.388889 | 1709.083333 | 1630.222222 | 0.305556 | ||
| 4 | 1 | 17826.722222 | 230000.000000 | 41.777778 | -0.222222 | -0.444444 | -0.444444 | -0.444444 | -0.333333 | -0.222222 | 76830.777778 | ... | 54168.444444 | 48627.277778 | 44930.000000 | 6911.166667 | 5409.555556 | 21159.555556 | 4728.555556 | 5367.388889 | 4589.555556 | 0.111111 | |
| 2 | 17408.521739 | 173478.260870 | 28.173913 | -0.478261 | -0.608696 | -0.608696 | -0.739130 | -0.608696 | -0.608696 | 64118.739130 | ... | 41932.000000 | 40017.826087 | 38676.000000 | 6991.782609 | 10154.260870 | 12042.173913 | 3885.695652 | 5578.782609 | 2044.608696 | 0.086957 | ||
| 3 | 23938.000000 | 150000.000000 | 27.000000 | -1.000000 | -1.000000 | 0.000000 | 0.000000 | -2.000000 | -2.000000 | 3462.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 1147.000000 | 6000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
| 5 | 1 | 20020.500000 | 209916.666667 | 39.354167 | -0.208333 | -0.395833 | -0.395833 | -0.437500 | -0.416667 | -0.479167 | 104611.791667 | ... | 78269.437500 | 66039.875000 | 49827.291667 | 8338.812500 | 9751.416667 | 9241.333333 | 4692.562500 | 5662.333333 | 11141.145833 | 0.083333 | |
| 2 | 18037.065217 | 107608.695652 | 32.630435 | 0.282609 | -0.021739 | -0.282609 | -0.304348 | -0.369565 | -0.565217 | 74828.086957 | ... | 53557.782609 | 45680.934783 | 37396.521739 | 6587.586957 | 6988.543478 | 3863.065217 | 5926.239130 | 1949.195652 | 6491.500000 | 0.043478 | ||
| 3 | 27406.000000 | 100000.000000 | 44.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 40199.000000 | ... | 30456.000000 | 29420.000000 | 27102.000000 | 0.000000 | 2005.000000 | 1700.000000 | 976.000000 | 2000.000000 | 3000.000000 | 0.000000 | ||
| 6 | 1 | 19884.500000 | 165714.285714 | 48.000000 | -0.714286 | -0.785714 | -0.785714 | -0.785714 | -0.857143 | -0.857143 | 78274.500000 | ... | 34296.285714 | 22276.857143 | 18218.000000 | 11288.214286 | 11468.571429 | 14158.928571 | 6087.000000 | 3178.000000 | 21956.428571 | 0.285714 | |
| 2 | 17602.545455 | 146363.636364 | 40.363636 | -0.090909 | -0.272727 | -0.454545 | -0.363636 | -0.545455 | -0.909091 | 37394.545455 | ... | 30280.272727 | 27390.272727 | 42439.818182 | 6263.636364 | 4687.000000 | 5460.090909 | 5968.909091 | 23591.545455 | 12306.818182 | 0.000000 | ||
| 2 | 0 | 1 | 21578.000000 | 135000.000000 | 37.500000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | 22374.500000 | ... | 23983.000000 | 20409.500000 | 21120.000000 | 1274.000000 | 6432.500000 | 1209.500000 | 1409.500000 | 1120.000000 | 5426.000000 | 0.000000 |
| 2 | 13962.000000 | 267500.000000 | 42.000000 | -0.500000 | -1.000000 | -0.250000 | -0.250000 | -1.000000 | -1.000000 | 7662.250000 | ... | 8435.500000 | 3705.500000 | 5042.750000 | 3725.000000 | 16506.250000 | 8390.750000 | 3711.000000 | 5046.500000 | 1969.750000 | 0.000000 | ||
| 1 | 0 | 13005.000000 | 160000.000000 | 36.333333 | -1.000000 | -1.000000 | -1.333333 | -1.000000 | -0.666667 | 0.333333 | 3148.666667 | ... | 26227.666667 | 38664.333333 | 37926.666667 | 1793.000000 | 11485.666667 | 16227.666667 | 15331.000000 | 425.666667 | 616.333333 | 0.000000 | |
| 1 | 15033.563976 | 245319.881890 | 38.870571 | -0.360236 | -0.622539 | -0.651575 | -0.664370 | -0.691929 | -0.687992 | 44337.774114 | ... | 40539.821358 | 38675.998524 | 36771.830709 | 7458.217028 | 9112.124508 | 6861.565945 | 6597.845965 | 6165.833661 | 6672.745571 | 0.186516 | ||
| 2 | 14823.686063 | 190821.360153 | 30.158046 | -0.234195 | -0.387452 | -0.399904 | -0.438697 | -0.442050 | -0.449473 | 43004.106082 | ... | 38311.943008 | 36581.905172 | 35304.199473 | 5955.380508 | 6170.817289 | 5543.457136 | 5075.066092 | 5307.253352 | 5779.649425 | 0.179119 | ||
| 3 | 14976.950000 | 184000.000000 | 38.150000 | -0.050000 | -0.850000 | -0.650000 | -0.950000 | -1.000000 | -0.700000 | 55192.750000 | ... | 35471.150000 | 36449.350000 | 30919.550000 | 47363.950000 | 70677.250000 | 58001.800000 | 35400.100000 | 5248.650000 | 20447.700000 | 0.150000 | ||
| 2 | 0 | 13161.800000 | 166000.000000 | 35.800000 | 0.000000 | 0.400000 | 0.000000 | 0.000000 | -0.400000 | -0.200000 | 31881.400000 | ... | 33039.000000 | 32055.400000 | 35445.800000 | 8286.000000 | 1914.000000 | 1355.200000 | 1046.000000 | 7170.200000 | 890.800000 | 0.200000 | |
| 1 | 15272.627013 | 164180.679785 | 37.527504 | 0.016324 | -0.087209 | -0.140877 | -0.202594 | -0.247317 | -0.256708 | 54135.562835 | ... | 45074.861807 | 42022.473390 | 40319.405188 | 5330.398927 | 4873.439401 | 4789.545394 | 4393.264758 | 4349.698122 | 5000.103757 | 0.228757 | ||
| 2 | 14888.093137 | 146840.686275 | 30.405637 | 0.062500 | -0.035539 | -0.060539 | -0.112500 | -0.160294 | -0.173529 | 53866.076961 | ... | 45934.761520 | 42826.745343 | 42034.511765 | 5083.827206 | 5131.499265 | 4450.631373 | 4406.327451 | 4794.213235 | 4812.797549 | 0.213971 | ||
| 3 | 13705.131313 | 96363.636364 | 40.868687 | 0.222222 | 0.151515 | 0.181818 | 0.111111 | -0.010101 | 0.080808 | 43196.595960 | ... | 31244.808081 | 28619.343434 | 31782.494949 | 5423.848485 | 2808.292929 | 3046.222222 | 4155.010101 | 6441.959596 | 2657.585859 | 0.252525 | ||
| 3 | 0 | 17418.250000 | 126562.500000 | 38.218750 | -0.218750 | -0.343750 | -0.187500 | -0.343750 | -0.468750 | -0.531250 | 19299.093750 | ... | 16440.750000 | 15021.406250 | 13575.656250 | 7840.125000 | 2032.031250 | 2587.531250 | 2069.156250 | 1696.937500 | 1798.937500 | 0.062500 | |
| 1 | 15914.895201 | 129387.755102 | 42.356315 | 0.076117 | -0.008274 | -0.027027 | -0.093767 | -0.176503 | -0.225593 | 45768.902372 | ... | 38414.981247 | 35702.306674 | 34083.664093 | 4464.586873 | 5389.212907 | 3651.939879 | 4267.031991 | 3440.367899 | 3539.447876 | 0.248759 | ||
| 2 | 15262.420690 | 135674.876847 | 36.240394 | 0.054187 | -0.080788 | -0.099507 | -0.132020 | -0.187192 | -0.258128 | 46695.179310 | ... | 38887.507389 | 35997.965517 | 34757.061084 | 6037.979310 | 5591.176355 | 5028.606897 | 3967.595074 | 3821.534975 | 4614.417734 | 0.219704 | ||
| 3 | 12542.238806 | 78059.701493 | 46.059701 | 0.179104 | 0.149254 | 0.089552 | -0.029851 | -0.223881 | -0.074627 | 35120.835821 | ... | 25805.328358 | 23484.611940 | 22727.223881 | 4012.880597 | 2338.238806 | 2317.985075 | 3097.313433 | 2051.865672 | 2192.597015 | 0.238806 | ||
| 4 | 1 | 18767.176471 | 247941.176471 | 37.941176 | -0.823529 | -1.205882 | -1.205882 | -1.176471 | -1.117647 | -1.147059 | 36574.382353 | ... | 33560.823529 | 24529.735294 | 24962.764706 | 3011.352941 | 5080.441176 | 7664.852941 | 3721.617647 | 3982.411765 | 3376.235294 | 0.029412 | |
| 2 | 17766.622222 | 222000.000000 | 30.355556 | -0.400000 | -0.622222 | -0.644444 | -0.733333 | -0.800000 | -0.688889 | 58092.511111 | ... | 38999.444444 | 33669.911111 | 30568.333333 | 6167.311111 | 6360.066667 | 6194.133333 | 7152.844444 | 7972.822222 | 6170.933333 | 0.044444 | ||
| 3 | 24423.500000 | 235000.000000 | 40.500000 | 0.000000 | -1.500000 | -1.000000 | -1.000000 | 0.000000 | -0.500000 | 442.500000 | ... | 15816.000000 | 8738.000000 | 15073.500000 | 2069.500000 | 5204.000000 | 15832.000000 | 2500.000000 | 8934.000000 | 2464.500000 | 0.000000 | ||
| 5 | 1 | 16835.029412 | 185784.313725 | 37.725490 | -0.235294 | -0.411765 | -0.460784 | -0.421569 | -0.431373 | -0.470588 | 79611.725490 | ... | 69553.529412 | 60851.901961 | 55692.274510 | 5949.441176 | 10857.911765 | 8148.911765 | 4519.970588 | 5952.843137 | 9212.509804 | 0.049020 | |
| 2 | 16274.679012 | 157901.234568 | 32.234568 | -0.185185 | -0.308642 | -0.320988 | -0.333333 | -0.345679 | -0.592593 | 73882.592593 | ... | 47816.987654 | 41699.111111 | 37828.271605 | 4358.839506 | 7240.703704 | 8597.432099 | 4803.333333 | 3971.358025 | 4902.888889 | 0.086420 | ||
| 3 | 15354.500000 | 110000.000000 | 37.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.500000 | 121721.000000 | ... | 109266.000000 | 56932.500000 | 9863.500000 | 4299.500000 | 5034.000000 | 5308.500000 | 15360.000000 | 2492.000000 | 1499.000000 | 0.000000 | ||
| 6 | 1 | 16326.214286 | 153571.428571 | 45.000000 | -0.071429 | -0.071429 | -0.071429 | -0.071429 | -0.214286 | -0.428571 | 85957.571429 | ... | 68506.928571 | 59849.857143 | 55758.285714 | 3936.071429 | 4103.142857 | 7060.285714 | 2364.428571 | 2318.571429 | 8970.571429 | 0.071429 | |
| 2 | 13957.400000 | 111000.000000 | 40.900000 | 0.300000 | -0.100000 | -0.200000 | -0.400000 | -0.400000 | -0.400000 | 111611.000000 | ... | 97425.200000 | 79874.800000 | 35851.500000 | 20040.600000 | 3590.900000 | 3193.000000 | 7908.200000 | 11285.400000 | 17985.900000 | 0.300000 | ||
| 3 | 10747.000000 | 185000.000000 | 42.000000 | 0.000000 | 0.000000 | 0.000000 | -0.500000 | -0.500000 | -0.500000 | 181866.000000 | ... | 10327.000000 | 10116.500000 | 78741.500000 | 8178.500000 | 4764.000000 | 416.000000 | 547.000000 | 73467.500000 | 2628.500000 | 0.000000 |
45 rows × 22 columns
# Cleaning the Data on Unrepresented Variables
#Since both Education and Marrage have a others catagory we decided to include the vaules to in the other catagory since
# We can not state the meaning of these.
bank_db_df['EDUCATION'] = bank_db_df['EDUCATION'].replace([0,5,6],4)
bank_db_df['MARRIAGE'] = bank_db_df['MARRIAGE'].replace([0],3)
#bank_db_df.EDUCATION.unique()
#bank_db_df.MARRIAGE.unique()
#Calculate % accounts that Defaulted (Default=1)
len(bank_db_df[bank_db_df["default payment next month"]==1])/len(bank_db_df)*100
22.12
#Create dataset with only default accounts
Default_accts = bank_db_df[bank_db_df['default payment next month']==1]
#Create dataset for Non-Default accounts
NonD_accts = bank_db_df[bank_db_df['default payment next month']==0]
#Citation: Drew, Jake M. “Jakemdrew/DataMiningNotebooks.” GitHub, 11 Aug. 2011, github.com/jakemdrew/DataMiningNotebooks/blob/master/02.%20DataVisualization.ipynb.
#Aggregate Default column by Categorical Variables (SEX,EDUCATION,MARRIAGE)
Cat_group = Default_accts.groupby(by=['SEX','EDUCATION','MARRIAGE'])
print ('Total in Group who Defaulted:')
print (Cat_group["default payment next month"].count())
print ('---------------------------------------------------')
print ('Percentage of the Group who Defaulted:')
print (round((Cat_group["default payment next month"].sum()/bank_db_df["default payment next month"].count()*100),3)) #group percentage of total default amount which equals 22.12
#Citation: Drew, Jake M. “Jakemdrew/DataMiningNotebooks.” GitHub, 11 Aug. 2011, github.com/jakemdrew/DataMiningNotebooks/blob/master/02.%20DataVisualization.ipynb.
Total in Group who Defaulted:
SEX EDUCATION MARRIAGE
1 1 1 390
2 506
3 10
2 1 643
2 746
3 19
3 1 303
2 229
3 13
4 1 10
2 4
2 1 1 379
2 748
3 3
2 1 1023
2 873
3 26
3 1 451
2 223
3 18
4 1 7
2 12
Name: default payment next month, dtype: int64
---------------------------------------------------
Percentage of the Group who Defaulted:
SEX EDUCATION MARRIAGE
1 1 1 1.300
2 1.687
3 0.033
2 1 2.143
2 2.487
3 0.063
3 1 1.010
2 0.763
3 0.043
4 1 0.033
2 0.013
2 1 1 1.263
2 2.493
3 0.010
2 1 3.410
2 2.910
3 0.087
3 1 1.503
2 0.743
3 0.060
4 1 0.023
2 0.040
Name: default payment next month, dtype: float64
Of all the accounts in the dataset, 22.12% defaulted (6636 total accounts). The 3.41% of married females with a university eductation who defaulted makes up the largest percentage of the defaulted group at 15.42%, while single females with a university are the second highest group to default(2.91%) accounting for 13.15% of the defaults, followed by the single female with a graduate school education default rate of 2.49% to account for 11.26% of total defaults.
The 0.013% of the single males with an "other" education group that defaulted makes up the smallest portion of the defaulted group at 5.877e-4%, second is the married females with an "other" education group who default rate is 0.023%, followed by a tie for the third lowest default rate at 0.033% between graduated educated males with an "other" martial status and married males with an "other" education.
#Create summary statistics
df_sub = bank_db_df[['SEX','EDUCATION','MARRIAGE','AGE','LIMIT_BAL']]
df_subd = Default_accts[['SEX','EDUCATION','MARRIAGE','AGE','LIMIT_BAL']]
df_subnond = NonD_accts[['SEX','EDUCATION','MARRIAGE','AGE','LIMIT_BAL']]
print (round(df_sub.describe(),2))
print ('Median values for AGE and LIMIT_BAL')
print (df_sub.median())
print('------------------------------------------------------')
print('AGE and Limit Balance on Defaulted accounts')
print (round(df_subd.describe()),2)
print ('Median values')
print (df_subd.median())
print('------------------------------------------------------')
print('AGE and Limit Balance on Non-Defaulted accounts')
print (round(df_subnond.describe(),2))
print (df_subnond.median())
SEX EDUCATION MARRIAGE AGE LIMIT_BAL
count 30000.00 30000.00 30000.00 30000.00 30000.00
mean 1.60 1.84 1.56 35.49 167484.32
std 0.49 0.74 0.52 9.22 129747.66
min 1.00 1.00 1.00 21.00 10000.00
25% 1.00 1.00 1.00 28.00 50000.00
50% 2.00 2.00 2.00 34.00 140000.00
75% 2.00 2.00 2.00 41.00 240000.00
max 2.00 4.00 3.00 79.00 1000000.00
Median values for AGE and LIMIT_BAL
SEX 2.0
EDUCATION 2.0
MARRIAGE 2.0
AGE 34.0
LIMIT_BAL 140000.0
dtype: float64
------------------------------------------------------
AGE and Limit Balance on Defaulted accounts
SEX EDUCATION MARRIAGE AGE LIMIT_BAL
count 6636.0 6636.0 6636.0 6636.0 6636.0
mean 2.0 2.0 2.0 36.0 130110.0
std 0.0 1.0 1.0 10.0 115379.0
min 1.0 1.0 1.0 21.0 10000.0
25% 1.0 1.0 1.0 28.0 50000.0
50% 2.0 2.0 2.0 34.0 90000.0
75% 2.0 2.0 2.0 42.0 200000.0
max 2.0 4.0 3.0 75.0 740000.0 2
Median values
SEX 2.0
EDUCATION 2.0
MARRIAGE 2.0
AGE 34.0
LIMIT_BAL 90000.0
dtype: float64
------------------------------------------------------
AGE and Limit Balance on Non-Defaulted accounts
SEX EDUCATION MARRIAGE AGE LIMIT_BAL
count 23364.00 23364.00 23364.00 23364.00 23364.00
mean 1.61 1.83 1.56 35.42 178099.73
std 0.49 0.75 0.52 9.08 131628.36
min 1.00 1.00 1.00 21.00 10000.00
25% 1.00 1.00 1.00 28.00 70000.00
50% 2.00 2.00 2.00 34.00 150000.00
75% 2.00 2.00 2.00 41.00 250000.00
max 2.00 4.00 3.00 79.00 1000000.00
SEX 2.0
EDUCATION 2.0
MARRIAGE 2.0
AGE 34.0
LIMIT_BAL 150000.0
dtype: float64
#Checking the Data Quality
bank_db_df.groupby(['SEX','EDUCATION','MARRIAGE']).mean().head()
| ID | LIMIT_BAL | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SEX | EDUCATION | MARRIAGE | |||||||||||||||||||||
| 1 | 1 | 1 | 14549.651479 | 262485.207101 | 42.423077 | -0.220118 | -0.418935 | -0.459763 | -0.495266 | -0.521893 | -0.528402 | 60196.886391 | ... | 51360.938462 | 48348.436095 | 46303.767456 | 7430.248521 | 8108.434911 | 8130.040828 | 7016.669822 | 6897.355621 | 7466.460355 | 0.230769 |
| 2 | 14363.201671 | 192085.074060 | 31.704899 | -0.147360 | -0.266996 | -0.273073 | -0.320927 | -0.345993 | -0.362704 | 54064.399164 | ... | 46617.539309 | 42545.724269 | 40471.004558 | 6866.550323 | 6717.813141 | 6578.889480 | 5373.064565 | 5485.293581 | 6514.678314 | 0.192176 | ||
| 3 | 12009.709677 | 169677.419355 | 44.290323 | -0.064516 | -0.193548 | -0.032258 | -0.129032 | -0.129032 | -0.258065 | 62585.483871 | ... | 55666.483871 | 51097.419355 | 51780.483871 | 5226.161290 | 6973.129032 | 2610.225806 | 2631.677419 | 7977.032258 | 3407.709677 | 0.322581 | ||
| 2 | 1 | 15296.578903 | 162101.265823 | 40.332489 | 0.115612 | 0.057806 | 0.008439 | -0.063291 | -0.117722 | -0.177215 | 56465.218143 | ... | 47514.228270 | 44393.006329 | 43099.936287 | 5455.835021 | 5856.121519 | 4710.043460 | 4800.077637 | 4719.933333 | 4900.730802 | 0.271308 | |
| 2 | 14656.010204 | 112442.176871 | 31.584354 | 0.272789 | 0.239456 | 0.198639 | 0.113605 | 0.020748 | -0.038776 | 50926.275850 | ... | 41269.722449 | 37845.888435 | 36963.920748 | 4416.209524 | 4920.652041 | 4330.766667 | 4031.282993 | 3906.684694 | 4146.418027 | 0.253741 |
5 rows × 22 columns
#Creating New Dataframe to munpluate the Data for PCA
bank_db_df2 = bank_db_df
#bank_db_df2
#Creating a concatenated columns to see if we can draw more correlations
concat_df = bank_db_df2['SEX'].map(str) + bank_db_df2['EDUCATION'].map(str) + bank_db_df2['MARRIAGE'].map(str)
concat_df2 = bank_db_df2['SEX'].map(str) + bank_db_df2['EDUCATION'].map(str)
concat_df3 = bank_db_df2['SEX'].map(str) + bank_db_df2['MARRIAGE'].map(str)
concat_df4 = bank_db_df2['EDUCATION'].map(str) + bank_db_df2['MARRIAGE'].map(str)
# Added the Column and Changed the type to numaric vaule
bank_db_df2["S_E_M"] = concat_df
bank_db_df2["S_E"] = concat_df2
bank_db_df2["S_M"] = concat_df3
bank_db_df2["E_M"] = concat_df4
bank_db_df2["S_E_M"] = pd.to_numeric(bank_db_df2['S_E_M'])
bank_db_df2["S_E"] = pd.to_numeric(bank_db_df2['S_E'])
bank_db_df2["S_M"] = pd.to_numeric(bank_db_df2['S_M'])
bank_db_df2["E_M"] = pd.to_numeric(bank_db_df2['E_M'])
bank_db_df2["Bill_AVG"] = bank_db_df2[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']].mean(axis=1)
bank_db_df2["PAY_AMT_AVG"]= bank_db_df2[['PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']].mean(axis=1)
#bank_db_df2.dtypes
bank_db_df2.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month | S_E_M | S_E | S_M | E_M | Bill_AVG | PAY_AMT_AVG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 1 | 221 | 22 | 21 | 21 | 1284.000000 | 114.833333 |
| 1 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 1000 | 0 | 2000 | 1 | 222 | 22 | 22 | 22 | 2846.166667 | 833.333333 |
| 2 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 1000 | 1000 | 5000 | 0 | 222 | 22 | 22 | 22 | 16942.166667 | 1836.333333 |
| 3 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 1100 | 1069 | 1000 | 0 | 221 | 22 | 21 | 21 | 38555.666667 | 1398.000000 |
| 4 | 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 9000 | 689 | 679 | 0 | 121 | 12 | 11 | 21 | 18223.166667 | 9841.500000 |
5 rows × 31 columns
bank_db_df_corr = bank_db_df2.drop(["ID"],axis=1)
Utilizing a correlation plot we first identified attributes that showed strong relationships to those that defaulted. This was the highest for the history of past payments which recorded data on payment delays before defaulting. Payment delay data specifically at the earlier time frames showed the highest correlation to default data.
corrMatrix = bank_db_df_corr.corr()
sns.heatmap(corrMatrix, annot=True)
plt.rcParams.update({'font.size': 30})
plt.rcParams["figure.figsize"] = (110,110)
plt.savefig('CorrMatrix.png')
Graph showing correlation matix. When reviewing the graph we identified that only Pay 0 through Pay 6 had a high correlation to the variable of intrest of default payment next month.
#Created Subsets based on Correlation Matrix
df_sub_PAY = bank_db_df2[["default payment next month","PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]] #splitting attributes based on high correlation
df_sub_BILL = bank_db_df2[["default payment next month","BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_AMT4","BILL_AMT5","BILL_AMT6"]]
# Scatter Plots of the Pay Feilds
sns.pairplot(df_sub_PAY, hue="default payment next month", height=2)
plt.rcParams.update({'font.size': 15})
plt.rcParams["figure.figsize"] = (20,20)
#Scatter Plots of the Bill Feilds
sns.pairplot(df_sub_BILL, hue="default payment next month", height=2)
<seaborn.axisgrid.PairGrid at 0x1968e421a90>
#Create a boxplot for the summary statistics for the categorical variables based on AGE
age_bplt = sns.boxplot(x="default payment next month",y="AGE",hue="SEX",data=bank_db_df,
width=.5,)
age_bplt.set_ylim([0,85]) #set the y-axis range
(0.0, 85.0)
#Create a boxplot for the summary statistics for the categorical variables based on Limit Balance
LIMBAL_bplt = sns.boxplot(x="default payment next month",y="LIMIT_BAL",hue="SEX",data=bank_db_df,
width=.80)
LIMBAL_bplt.set_ylim([0,900000])#set the y-axis range
(0.0, 900000.0)
#Focused Scatter plot on Pay 0 and Pay 2 Columns
sns.scatterplot(data=df_sub_PAY, x="PAY_0", y="PAY_2", hue="default payment next month", size="default payment next month")
<AxesSubplot:xlabel='PAY_0', ylabel='PAY_2'>
#Focused Scatter plot on Pay 0 and Pay 2 Columns
sns.scatterplot(data=df_sub_BILL, x="BILL_AMT5", y="BILL_AMT6", hue="default payment next month")
<AxesSubplot:xlabel='BILL_AMT5', ylabel='BILL_AMT6'>
vars_to_plot_separate = [['BILL_AMT1'],['BILL_AMT2'],['BILL_AMT3'],['BILL_AMT4'],
['BILL_AMT5'],['BILL_AMT6'],['PAY_AMT1'],['PAY_AMT2'],
['PAY_AMT3'],['PAY_AMT4'],['PAY_AMT5'],['PAY_AMT6']]
for index, plot_vars in enumerate(vars_to_plot_separate):
plt.figure(figsize=(12,12))
ax = bank_db_df2.boxplot(column=plot_vars,by='default payment next month')
plt.show()
<ipython-input-58-bd3145546e1d>:5: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). plt.figure(figsize=(12,12))
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
<Figure size 864x864 with 0 Axes>
Examining the differences in payment history was best viewed in a violin plot comparison that clearly indicates an increased likelihood of default for increased payment delays. This makes sense logically as increased delays in payment will most likely lead to a client defaulting. It is also noted that the later the client begins to delay payments also increases risk of default as the client is likely begins to struggle making payments after having made several previous payments, this especially increases as the delays are continued specifically at 3 months and beyond.
The does appear to be heavy skewness towards old customers and higher limit balances for both sexes. The median Age, Education, and Marriage status are the same for the total group and the filtered dataset with just defaulted accounts. When grouping by SEX, the average for males appear to be higher(see boxplot above) than the average female no matter the default status.
However, the median limit balance of $140,000 is higher than the $90,000 median limit balance for the defaulted group. Contrary to the average age between the two sex, the average limit balance for females is higher than males no matter the default status.
from sklearn.preprocessing import StandardScaler
features = ['PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','BILL_AMT1',
'BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6',
'PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
# Separating out the features
x = bank_db_df2.loc[:, features].values
# Separating out the target
y = bank_db_df2.loc[:,['default payment next month']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
### Utilized code from https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
from sklearn.decomposition import PCA
pca = PCA(n_components=9)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2',
'principal component 3','principal component 4',
'principal component 5','principal component 6',
'principal component 7','principal component 8',
'principal component 9'])
pca.explained_variance_ratio_
### Utilized code from https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
array([0.36180187, 0.20618472, 0.08362378, 0.05085274, 0.04933822,
0.04845462, 0.04349448, 0.04070514, 0.03800722])
finalDf = pd.concat([principalDf, bank_db_df2[['default payment next month']]], axis = 1)
finalDf.head()
| principal component 1 | principal component 2 | principal component 3 | principal component 4 | principal component 5 | principal component 6 | principal component 7 | principal component 8 | principal component 9 | default payment next month | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.774457 | -0.613834 | -0.633142 | -0.474713 | -0.033614 | -0.077654 | -0.051448 | 0.232366 | 3.055349 | 1 |
| 1 | -0.660334 | -2.051521 | 0.426726 | 0.113607 | 0.148504 | -0.100264 | 0.071387 | 0.134269 | -0.852330 | 1 |
| 2 | -0.766104 | -0.934008 | 0.023527 | -0.003720 | -0.099812 | -0.113338 | 0.035160 | 0.062651 | -0.217292 | 0 |
| 3 | -0.114276 | -0.627801 | -0.289996 | 0.133394 | -0.044445 | 0.052357 | -0.044003 | 0.025403 | -0.192321 | 0 |
| 4 | -0.851808 | 0.029702 | 0.965619 | 0.731632 | 0.121898 | 0.497529 | -0.258582 | 1.065251 | -0.624595 | 0 |
# Looking at the diffrencee of Pay 0 and Bill Amount 1
plt.scatter(bank_db_df2['PAY_0'],bank_db_df2['BILL_AMT1'],c=bank_db_df2['default payment next month'],label=bank_db_df2['default payment next month'])
<matplotlib.collections.PathCollection at 0x196876c0a30>
# Violin Plot of Pay 6 and Pay 5
f, ax = plt.subplots(figsize=(9, 9))
sns.violinplot(x="PAY_6", y="PAY_5", hue="default payment next month", data=bank_db_df2,
split=True, inner="quart")
<AxesSubplot:xlabel='PAY_6', ylabel='PAY_5'>
# Violin Plot of Pay 0 and Pay 2
f, ax = plt.subplots(figsize=(9, 9))
sns.violinplot(x="PAY_0", y="PAY_2", hue="default payment next month", data=bank_db_df2,
split=True, inner="quart")
<AxesSubplot:xlabel='PAY_0', ylabel='PAY_2'>
We created from existing categories based on Gender, Education and Marital Status. These were combined into new groupings listed below:
Gender/Education/Marital Status
Gender/Education
Gender/Marital Status
Education/Marital Status
This is to look if there was more than a binary relationship to defaulted payments being predicted. We tried this approach to see if we could provide more dimensionality to these categorical variables so that we could use them to show a higher correlation to our goal if we can predict default payments each month. In addition based on the correlation Matrix we could did create a new column that would Pay columns which are strongly correlated to one another. This action also was reproduced on the Bill Amount columns as well. This is to try to get a more highly correlated column to default payment next month column which is predicted value.
We utlized feature creation in this project to further explore the data. This was from combining more catagorical variables together to taking averages of several columns to create new coulmns in our data. This is in a effort to draw better results from the variables that are present. This provided futher insight into the work showing us that the catagorical variables of Gender, Education and Marital Status does have a large impact if someone will defult or not.
Through our EDA, we were not able to confidently identify strong predictors in predicting default payment of credit card borrowers. Therefore, we believe further transformations of the data set is appropriate to identify stronger correlations in order to predict our variable of interest . We will explore logarithmic changes and create additional interaction variables. Ultimately as the data stands we believe that a random forest model would be the best predictive model.The main reason is due to the low amount of present correlations to our identified target variable of default payment next month.